English and Chinese Bilingual Topic Aspect Classification: Exploring Similarity Measures, Optimal LSA Dimensions, and Centroid Correction of Translated Training Examples
نویسندگان
چکیده
This paper explores topic aspect (i.e., subtopic or facet) classification for collections that contain more than one language (in this case, English and Chinese), and investigates several key technical issues that may affect the classification effectiveness. The evaluation model assumes a bilingual user who has found some documents on a topic and identified a few passages in each language on specific aspects of that topic that are of interest. Additional passages are then automatically labeled using a k-NearestNeighbor classifier and local (i.e., result set) Latent Semantic Analysis (LSA). Experiments show that when few manually annotated passages are available in either language, a classification system trained using passages from both languages can often achieve higher effectiveness than a similar system trained using passages from just one language. Using this experimental framework, this paper answers three technical research questions: whether the normalized cosine similarity measure is better than the more common unnormalized cosine similarity measure (yes), whether the number of retained LSA dimensions (which was heuristically chosen) is appropriate (yes), and whether partial corrections of the translated training examples in the LSA space can yield an improvement over no correction (no).
منابع مشابه
Classifying Attitude by Topic Aspect for English and Chinese Document Collections
Title: Classifying Attitude by Topic Aspect for English and Chinese Document Collections Yejun Wu, Doctor of Philosophy, 2008 Dissertation directed by: Professor Douglas W. Oard College of Information Studies & Institute for Advanced Computer Studies, UMCP The goal of this dissertation is to explore the design of tools to help users make sense of subjective information in English and Chinese by...
متن کاملBilingual-LSA Based LM Adaptation for Spoken Language Translation
We propose a novel approach to crosslingual language model (LM) adaptation based on bilingual Latent Semantic Analysis (bLSA). A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by, first, in...
متن کاملSearching a Russian Document Collection using English, Chinese and Japanese Queries
As in CLEF 2003, Berkeley experimented with the CLEF Russian Izvestia document collection with monolingual and bilingual runs for the Russian collection. For CLEF 2004 we also experimented with Chinese and Japanese as topic languages, using English as the ‘pivot’ language. For bilingual retrieval our approaches were query translation (for English as a topic language) and ‘fast’ document transla...
متن کاملWord Sense Disambiguation Using Automatically Translated Sense Examples
We present an unsupervised approach to Word Sense Disambiguation (WSD). We automatically acquire English sense examples using an English-Chinese bilingual dictionary, Chinese monolingual corpora and Chinese-English machine translation software. We then train machine learning classifiers on these sense examples and test them on two gold standard English WSD datasets, one for binary and the other...
متن کاملBilingual Co-Training for Sentiment Classification of Chinese Product Reviews
The lack of reliable Chinese sentiment resources limits research progress on Chinese sentiment classification. However, there are many freely available English sentiment resources on the Web. This article focuses on the problem of cross-lingual sentiment classification, which leverages only available English resources for Chinese sentiment classification. We first investigate several basic meth...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013